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In this paper, we consider capture-recapture experiments with heterogenous catchabihty. 
In the setting we consider, the widespread Huggins-Alho estimator is not very suitable and 
we introduce and study a new generalized Horvitz-Thompson estimator. Our motivation is 
Respondent Driven Sampling (RDS), a prime example for such a setting where the capture 
probability is dependent on both the unknown population size as well as on an observable 
covariate, the network degree of an individual, due to peer recruitment. After discussing 
the theoretical properties of the new estimator, with full details given in the appendix, we 
evaluate it on various empirical and simulated data-sets, focusing on an RDS survey in a 
population in rural Uganda in which the population size is known a priori. The results thus 
obtained demonstrate that the adjusted estimator is less biased than the naive Lincoln- 
Petersen estimator. 



1 Introduction 

When using naive capture-recapture methods to estimate the size of a population under investi- 
gation, it is well known that if the population sampling is not homogeneous (i.e., heterogeneous 
catchabihty) the estimates can be biased. One way to overcome this problem is to employ 
models in which detection 
probability distribution |21 



arobabilities are viewed as latent parameters described by some 

1, see [3]). 



251 ] (for a review and criticism of this approach 
An alternative approach, which is more relevant to the circumstances we are interested in, 
is to model heterogeneity explicitly by identification of individual covariates that are thought 
to explain variation in detection probability (e.g., size, weight and age, which are commonly 
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used in wildlife biology). For instance, an early paper by Pollock, Hines and Nichols 
forward a strategy of stratification of individuals into a finite number of classes, yielding K strata 
with stratum population sizes {-/Vj}^]^, where the collection of Ni parameters is the object of 
inference. One obvious shortcoming of this method (in addition to the increase in parameter 
dimensions with the cardinality of the covariate) is that the covariates for the uncaptured 
animals are not observable and therefore. Pollock et al had to categorize all animals into several 
groups and use the midpoint of the classifying covariate as a representative covariate for the 
group. Another strategy avoids this difficulty through linear logistic modelling of capture 

probabilities, using the individual covariates of the captured individuals, thus obviating the need 
for the covariates of the uncaptured in the analysis. In this strategy, is a derived parameter, its 
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estimation is based on a generalized Horvitz-Thompson estimator (HT), and classical methods 
of asymptotic inference are employed (for a comprehensive review and comparison with a third 
approach, the so-called "joint likelihood" approach, which specifies the joint distribution of the 
capture data and the covariate, see 



iiggins-Alho estimator (HA), it is known to be 
3], and also the limitation in [l| (p. 626) on the 



|23|). 

Despite the justified widespread use of the Hi 
problematic if capture probabilities are low (see 

catchability from having a large number of small values). In addition, as Royle points out [2f 
the Huggins-Alho estimator has the conceptually unappealing aspect to it that N (the object 
of inference) is a derived parameter, formulated explicitly as a function of nuisance parameters 
(the capture probabilities, which are usually not of direct interest). 

The above two shortcomings are particularly crucial for the case we are interested in here, 
where the capture probabilities depend on the size of the population, and hence can also take 
many small values. This contrasts with common scenarios in wildlife biology where, for example, 
the size of an individual has a simple bearing on its capture probability irrespective of A^, and 
thus the HA estimator can sum the inverses of the (estimated) capture probabilities to produce 
an estimate of N. One example for such a situation, where the capture probabilities depends 

PQ 

on N, is respondent driven sampling (RDS) [7|, la]- 

RDS is an approach to sampling design for "hidden" populations such as marginalized or 
highly stigmatized populations (e.g., injection drug users, men who have sex with men, sex 
workers). RDS utilizes the networks of social relationships that connect members of the target 
population to facilitate sampling by chain referral methods. Although this is expected to result 
in biased sampling (such as over-sampling participants with many acquaintances), most RDS 
studies, interested, for example, in the prevalence of a disease within a population, typically 
conduct a "single stage" survey and use the information about how members of the target 
copulation are connected to weight recruits in a way to attempt to account for biased sampling 
7, ^. Although RDS studies typically have a single stage design, it is possible to use RDS 
data as the second stage of a two-stage survey, such as in capture-recapture or the "multiplier 
method" , with the aim of estimating the size of a population (here we do not elaborate on the 
differences between the two methods since our approach applies to both. Briefly however, in a 
strict capture-recapture one has two random samples obtained by actively sampling twice. In 
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the "multiplier method" , only one of the samples is random, such as the RDS sample, and the 



other sample need not be random, e.g. service data 
application of RDS in a capture-recapture setting [10 



^. Nevertheless, there has been recently a flourish of interest in this approach 



. Almost a decade has past since the first 

n 

, |9|] with little justification or analysis since 
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with additional RDS surveys completed recently, or underway, in more than 15 countries as 
part of a two-stage survey designed to estimate the size of various hidden populations 0. 

Due to peer recruitment, the sampling probabilities in RDS, {vr-t}^^, are assumed to be 
proportional to the degree, i.e. vTj = ^ (where di is the degree of individual i and z is the 
mean degree) though still unknown, because the size of the population is unknown. Notice that 
although we may start with a more explicit form of the catchability than HA, here we do not 
go further and attempt to estimate it exactly; only a proportionality relation is required. 

In this paper, we present the first detailed analysis of a generalized HT estimator which 
is highly applicable to cases such as those described in the previous paragraph. Although the 
sample design and the assumptions are a bit different than those used in the HA approach, they 
are not more restrictive and can be extended to many other cases. 

After introducing some notations and our new adjusted estimator in section 2.1, we start 
by analyzing the behavior of a naive Lincoln- Petersen estimator (LP), before comparing it to 
the adjusted estimator. In section 2.2 we show that the adjusted estimator has the desirable 
properties of converging to true population size in probability as — t- cxd, as well as being 
indifferent to unknown capture probabilities in one of the two stages. In section 3 we test the 
adjusted estimator on various empirical data-sets, focusing mostly on data from an RDS survey 
in a population in rural Uganda in which the population size is known a priori (sec. 3.1), but 
also in section 3.2 on two additional data-sets. This can also be regarded as an examination of 
the utility of RDS sampling design and weighting, which have been under some consideration 
lately ^, 5|. Our main findings in section 3 are that RDS sampling and the adjusted estimator 
allow us to obtain a relatively accurate estimate of the population size (in contrast to the 
naive approach). In section 4 we discuss our findings and the ways in which such an analysis 
could be applied easily and without violating confidentiality (of non-RDS individuals) in similar 
circumstances. 



^For more detailed information contact the corresponding author 
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2 Theory and simulations 
2.1 Setting and notations 

Let N be the (unknown) size of the population to be estimated. After samphng the popula- 
tion twice independently, imagine that a researcher finds oi^i individuals who appeared in both 
samples, ai^ individuals who appeared only in the first sample and ao,i individuals who ap- 
peared only in the second sample. A naive estimate might be obtained by the Lincoln-Petersen 
estimator: 

jCr _ (ai,o + Qi,i)(qo,i + , 

«1,1 

Unfortunately, even if the two samples are independent, unequal and positively correlated catch- 
ability in both sampling stages results in an overrepresentation of "marked" individuals in the 
sample and as a consequence a smaller N [ill . \\^ : this situation may easily happen, for exam- 
ple, if there are two (independent) RDS studies, or a "sampling stage" comprised of records of 



people associated with a transmissible disease (where, in 



theory. 



the more friends you have the 



221 have taken a different ad-hoc 



more likely you are of being infected). A few papers 
and heuristic approach. Taking an analogous assumption to the standard capture-recapture as- 
sumption, they assumed that the (weighted) proportion of individuals captured a second time 
(compared to the weighted size of the second sample) is equal to the proportion of individuals 
captured in the first sample (compared to N), that is: 



,1,1 rf,l 



where I^'-' is the indicator whether individual k was caught or not in each sample, and vr^ is his 
catchability. Rearranging (2) we get the following estimator for A^: 



itO,1 



Na., = iY. 4° + E 4'^)15^i^p^ (3) 

If the TTj's are not known, but we are able to measure the value of a covariate, vfj, proportional 
to vr, we can easily use it instead for (3). The case where the catchabilities in both samples are 
uncorrelated, in particular when the sampling in the second stage is uniformly at random, is well 
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known to be trivial and similar to the case of equal catchability [lllfisl. Therefore we now focus 
on the case they are positively correlated, and in particular, using RDS as an example (where 
in both stages the probability that individual k will be sampled is proportional to his degree 
dk) we start by showing why Nnaive is a good estimator (in expectation) when the catchability 
is more or less homogeneous, and why it is not to be trusted when it is heterogeneous. 

Assume in the first stage (second stage) individuals are sampled until Si individuals (52) 
are caught with 

•^i = ai,o + = diN 
S2 = ao,i + = 

Taking the RDS assumption (recruitment is performed in a "random walk" -like manner, re- 
sulting with the degree-dependent catchability equal to the stationary distribution of a random 
walk on the network) with as the correctly normalized probability for a person with degree 
n to be caught in the first sample (similarly, is the probability of capture in the second 
sample), the expected number of individuals from a population with a degree distribution {pn} 
and mean degree z := J2nPn^ ^^^^ caught twice is: 

N 

k k=l n 



Thus, 

Naia2m2 

z' 



^E4']= y (5) 

k 

where m2 is the second moment of the degree distribution. Before evaluating E[-|j^y]) it is 
worth pointing out that in general I^'^ can be equal to zero; however, capture-recapture 
studies where no one gets captured twice are trivial and we therefore exclude this possibility 
from our considerations. Now, plugging (5) into Jensen's equality for a lower bound, 

and into Kantorovich's inequality for an upper bound 
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with S denoting min{Si, 82)- Combined and simplified, the two give 

^^<E[^]<(^il^ (6) 

1712 ~ Efelfe ~ 45 ma ^ ^ 

Thus we conclude that for a homogeneous catchability (i.e., a degree distribution with ma ~ z'^) 
a researcher using a naive LP estimator, thus worried about an underestimate, can be reassured 
by the lower bound which is close to N. Quite the opposite is true for heterogeneous networks 
(having relatively large ma) in that the correction terms to N might take very small values, 
yielding a gross underestimate (see figure 1). 

2.2 Convergence and robustness of Nadj 

Turning to Nadj, our first main result is that asymptotically Nadj converges in probability to 
the true population size. The strategy for proving this is to notice that both in the numerator 
of the right hand side of (3), as well as in the denominator, we have a sum of Bernoulli- like 
random variables. Since a sum of such variables is concentrated close to its expectation (as 
we show in the appendix using the method of bounded differences (lemma 1)) we can find the 
expectation of the expression and conclude that their ratio is also concentrated close to what 
we need. 

In the process proving of theorem 1 we also find our second main result, that as long as 
we use the appropriate capture probabilities for one of the sampling stages, it does not matter 
what the capture probabilities are for the other stage. 

Let TTi := ^TTi be the probability to sample individual i at the recapture stage with a sample 
size of S2, with the TTj's being constants associated to an observable covariate. Instead of (3), 
we can (and should, from a researcher running the survey's point of view) use the Tfj's and be 
concerned with 

^a,, = (E 4'° + E 4'^)^5^i^g^ (7) 

And now we have 



Theorem 1. As N ^ 00, ^ ^ I 
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Proof. See appendix. 

Remark 1.: In general, we do not have convergence in expectation, i.e., E[^^^^] — t- 1. How- 
ever, if we restrict ourselves to cases where at least one individual is captured twice (and thus 
Nadj is bounded from above by N ) we have E[^] 1 as weh. 



Remark 2.: As a by-product, note that in the proof of the theorem 1 we made no assump- 
tions about the capture probabilities in the first stage, apart from the fact that they sum up to 
aiN; thus we also have 

Theorem 2. Nadj is indifferent to changed capture probabilities in the first stage. 
2.3 Simulations 

To demonstrate the sensitivity of the naive estimator to heterogeneity in the capture proba- 
bilities and compare it to Nadj, we simulated a capture-recapture RDS-like design where the 
probability of sampling is correlated with the capture and recapture stages, because the prob- 
ability of sampling individual i is proportional to his degree dj. We calculated the naive and 
adjusted estimators for a population with a power-law degree distribution, over a range of val- 
ues of A, which is a measure of the heterogeneity in degree between individuals. For each A, 
we sampled the population twice; in the capture phase, individuals were sampled in relation to 
their degree; and in the recapture phase, individuals were sampled by an RDS-like process, and 
Nnaive and Nadj Were calculated (figure 1). For large values of A, corresponding to homogenous 
degrees, Nnaive and Nadj were similar and close to the true population size. However, as A 
decreased, the naive estimate decreased and worsened, while Nadj was relatively constant across 
the range of A. If individuals have equal capture probabilities (but recapture is by RDS), then 
Nnaive and Nadj are similar, irrespective of A (not shown). 



3 Evaluation on empirical data 



In this section we examine the theory on different data-sets, focusing on a 



survey from rural 



Uganda (sec. 3.1) which is the most comprehensive, but also on two others 6 



3C| 
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Figure 1: Naive and adjusted estimates for a population with a power-law degree distribution, 
over a range of values of degree heterogeneity (A). Mean and one standard deviation of naive 
(blue) and adjusted (red) estimates of population size across a range of networks with different 
levels of degree heterogeneity captured by the parameter A, with high values corresponding to 
networks with a relatively uniform degree, assuming sampling proportional to degree during 
the capture stage. A total population size of = 2500 and equal numbers of individuals in 
the capture and recapture stages. Si = S2 = 100, were used. Random networks were generated 
according to a power-law distribution, p„ ~ n~^, with a minimum degree of 3. The recapture 
phase was simulated by an RDS-like process, in which a single seed individual was sampled, and 
one of its network neighbours sampled; this is followed by random sampling of the combined 
neighbours of these two nodes and so on, until the desired sample size was obtained. To avoid 
recruitment rates being strongly affected by degree, we limited the number of recruits per 
individual to 3. For each value of A, 20 networks were constructed and on each one, the process 
was run 50 times (i.e., 1000 repetitions per A). 
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This section can also be considered an examination of RDS sampling and inference. Ac- 
cordingly, we mostly use different empirically obtained RDS trees (and combinations of them) 
as a recapture stage. Starting with the Uganda data-set, we first use simple random (uniform) 
sampling or degree-biased sampling from the general population as a capture stage (section 
3.1). Then use records of association to an "age" category as the "capture" group. 

Section 3.2 wraps up with an evaluation of the method on the POO network, recently used 
'or an examination of the utility of RDS [6], and on a WebRDS survey from a large university 



30|. Using official institutional records of the number of students from different racial groups 
we were able to use them as a "capture" stage (similar to using the "age" data in sec. 3.1), 
thus implementing the method in a simple manner. 

3.1 Evaluation on the Uganda data-set 

The data used to define our first "target population" were available from an ongoing general 
population cohort of 25 villages in rural Masaka, Uganda. Annually, after obtaining consent, a 
total-population household census and an individual questionnaire are administered and blood 
taken for HIV-1 testing from this population. The target population consisted of = 2402 
men who were recorded as a male head of a household within the general population cohort 
study villages between February 2009 and January 2010. 

As part of an evaluation of RDS methodology an RDS survey was carried out on this 
population, employing current RDS methods of sampling and statistical inference (details are 



available in full in [18|). Briefly, 10 seeds recruited 927 male household heads over 54 days 
using RDS. The total number of recruits originating from each seed ranged from 8 to 241 (0.9% 
to 26.3%). The number of waves ranged from 3 to 16. The mean degree of RDS recruits 
(including seeds) was 12.1. Data were also collected on a simple random sample of 300 eligible 
male household heads who were not recruited during the RDS. 54% (162/300) completed the 
interview. The distribution of the reported degree of RDS recruits was approximately Gaussian 
and showed likely over-reporting of multiples of 5 (fig 2), similarly to the SRS sample of non- 
recruits. 

Thanks to the extensive nature of the Uganda data-set we were able to assign degrees to all 
individuals in the population using the degrees in the simple random sample (SRS). This was 
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Figure 2: The distribution of the reported degree of RDS and SRS recruits. 

done by concatenating the degree sequence in the SRS (162 individuals) until a degree sequence 
of length 1475 (=2402-927) was obtained. 

For our first experiment we sampled randomly from the general population to obtain a 
"capture" group and used the RDS data to obtain a "recapture" group. For the left column of 
figure 3 (panels a,c,e and g) we used a simple random (uniform) sample of 250 individuals from 
the general population as the capture stage; whereas for the right column of figure 3 (panels 
b,d,f and h) we used a degree-biased sample of 250 individuals from the general population. 
Having 10 different RDS trees there is a total of 2^'^ — 1 = 1023 different combinations of trees 
possible as the recapture group. In addition, it is also possible to "bootstrap" and sample 
uniformly at random from the 927 RDS participants, a subsample of, say, 200 individuals. In 
order to preserve clarity, figure 3 shows only the results of using combinations of 1,2,8 or 9 
RDS trees as the recaptured group. For each possible combination of a single tree (top two 
panels a and b), a pair of trees (panels c and d), eight trees (panels e and f) or nine (panels 
g and h) we evaluated Nnaive (blue circles) and Nadj (red squares). 50 repetitions were done 
(of sampling from the general population) and the mean and an error bar of one standard 
deviation are plotted vs the size of the combined group (i.e., number of individuals). Note that, 
as expected, when sampling was done uniformly at random (a,c,e,g) both estimators yielded 
a good estimate; however, when sampling was done in a degree-biased manner (b,d,f,h) the 
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Figure 3: Estimated population size for different combinations of RDS trees used as the "re- 
capture" sample (vs the size of the combined group). For the left column of figure 3 (panels 
a,c,e and g) we used a simple random (uniform) sample of 250 individuals from the general 
population as the capture stage; whereas for the right column of figure 3 (panels b,d,f and h) 
we used a degree-biased sample of 250 individuals from the general population. In order to 
preserve clarity, we only shows the results of combinations of 1,2,8 or 9 RDS trees as the recap- 
tured group. For each possible combination of a single tree (top two panels, a and b), a pair of 
trees (panels c and d), eight (panels e and f) or nine trees (panels g and h) we evaluated N naive 
(blue circles) and Nadj (red squares). 50 repetitions were done (of sampling from the general 
population) and the mean and an error bar of one standard deviation are plotted vs the size of 
the combined group (i.e., number of individuals) Black line shows true population size (2402). 



naive estimator substantially underestimated the size of the population whereas the adjusted 
estimator still provided a good estimate. Using other combinations of trees gave similar results 
(data not shown). 

In addition, we also tested the estimator with a "bootstrap" recapture group selected (uni- 
formly) at random from the RDS population, with the result shown in figure 4. It should be 
noted that we did not sample this recapture group (RDS subsample) according to its degree, 
and therefore it is very encouraging to see that weighting by the reported degrees improved the 
estimates almost to perfection. 

In many surveys, and in RDS in particular, there are rarely two different capture stages. In 
fact, it could be argued that the "capture" stage in our simulations is not easily, or cheaply, 
reproducible in practice. We therefore took a more realistic and challenging approach: using 
records of association to an "age" category as the "capture" group we reproduced the above 
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Figure 4: Estimated population size vs the size of the "capture" sample. After sampling X 
individuals (where X = 100, 200, 1500) from the general population uniformly random (thin 
lines) or with probability proportional to their degree (thick lines) we sampled 200 individuals 
from the 927 RDS participants {uniformly random) as the "recapture" sample. For each of 
the 15 cases we performed 50 repetitions and plotted the mean of the estimator with error bars 
representing the standard deviation. Notice Nadj (red) is robust with respect to the catchabilities 
in the capture stage (theorem 2); i.e. it works well for both sampling methods. 



analysis. It should be noted, as discussed further in section 4, that this approach is not only 
highly generalizable but also does not involve violation of the privacy of individuals not partici- 
pating in the "recapture" stage; the only information required from the "records" is the number 
of recorded individuals. 

Each age group at a time (0-19, 20-29, 30-39, 40-49 and 50+, with sizes 47, 484, 660, 496 and 
714 respectively) was used as a "capture" stage. 300 individuals were then sampled uniformly 
at random from the RDS sample as a "recapture" stage and estimates of were found; 100 
repetitions were done for each age group and the mean and one standard deviation, a, are plotted 
vs the size of each group (fig. 5). The large overestimate based on the 0-19 age group data (with 
its large variability; (Jnaive = 3231 and aadj = 6402) is probably not only due to the small size of 
the group (see, e.g., the large variance in fig. 4 for small "capture" group size), but rather, as an 
additional qualitative survey revealed [ISj] , because young unmarried "head of households" were 
not considered eligible for recruitment by their peers, and thus were undersampled. Estimates 
from the rest of the age groups are reasonable, surrounding N = 2402, and when averaged over 
all such age groups yielded an overestimate of ~ 10% relative to = 2402. 
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Figure 5: Estimating the size of the population using the "age" category as a "capture" stage. 
Each of the age groups 0-19, 20-29, 30-39, 40-49, 50+ (with true population sizes of 47, 484, 
660, 496, 714 respectively) at a time was used as a "capture" stage. 300 individuals were then 
sampled uniformly at random from the RDS sample as a "recapture" stage and estimates of 
N were found; 100 repetitions were done and the mean and one standard deviation, a, are 
plotted vs the size of each group. The large overestimate from the 0-19 age group (with its 
large variability; dnaive = 3231 and aadj = 6402) is probably not only due to the small size 
of the group (see, e.g., the large variance in fig. 4 for small "capture" group size), but rather 
because young unmarried "head of households" were under-recruited (see text). 

The relatively similar behavior of the both estimators can be attributed to the fact that 
the degree distribution is rather homogenous (cf. fig. 6, described in sec. 3.2, obtained from a 
network with a more heterogenous degree distribution), and, of course, by the fact that different 
age groups might be less strongly correlated with degree than what one might presume. 



3.2 Additional data sets - Project 90 and WebRDS 

In addition to testing the method on the Uganda data, we also evaluated it on two additional 
data-sets. Although not as extensive as the Uganda data, they complement each other in a 
sense - the first. Project 90, consists of a detailed network (without an RDS sample), whereas 
the second, the WebRDS data, is made up of a web-based respondent driven sample of students 
of a large university (without details of the complete network). 

The first source of data. Project 90, was a large study that began in 1987 that was de- 
signed to examine the influence of network structure on the propagation of infectious disease 
by constructing a network census (N = 5475) of high-risk heterosexuals in Colorado Springs 
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Figure 6: The distribution of Nadj and Nnaive for Project 90. After simulating an RDS "capture" 
stage, as in sec 2.2, with 300 individuals, we sampled 300 individuals again as the "recapture" 
stage and evaluated Nadj and Nnaive- This was done 10^ times and the distributions of Nadj 
and Nnaive Were plotted. 



14l ]. Recently, the Project 90 network was used to assess the utility of RDS [6[ ; we therefore 
used it as our networked population, and attempted to infer its size by RDS and the adjusted 
estimator. After simulating an RDS "capture" stage, as in sec 2.2, with 300 individuals, we 
sampled 300 individuals again as the "recapture" stage and evaluated Nadj and Nnaive- This 
was done 10^ times and the distributions of Nadj and Nnaive were obtained (fig. 6). Note that, 
in agreement with the findings of ref. 6|], although an RDS adjusted estimator has a small 
bias {N = 5475, average Nadj = 5970), it does have a larger variance than the naive estimator. 
However, fig. 6 clearly shows that the adjusted estimator is much better concentrated near the 
true population size. 

The second source of data, the university WebRDS js^, is much smaller, with data avail- 
able only for the 378 participants in the sample. However, this sample, genuinely driven the 
respondents themselves, together with available official institutional records 
conduct a proper examination of both estimators. 

First, we used the race group "White" as a capture group, with size = 6716 according to 
institutional records, and checked how many white students (and with which degrees) appeared 
in the WebRDS sample, thus finding a way to obtain Nadj = 11055 and Nnaive = 9437. Then, 
we performed the same procedure with the rest of the racial tags (table 1). While the large 
overestimation using the "Hispanic" group looks discouraging, a possible explanation is that 
only 8 Hispanics were "recaptured" , probably because the initial convenience sample had only 



29|, allowed us to 
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Race 


White 


Asian 


Hispanic 


Black 


non US 


Enrolled 


6716 


2191 


748 


712 


1073 


In sample 


269 


58 


8 


14 


14 


N ■ 

^ ' naive 


9437 


14279 


35343 


19224 


28971 


Nadj 


11055 


10728 


39482 


13806 


20552 



Table 1: University statistics: estimating the number of students usingracial groups as a 
"capture" stage. N = 13510 in institutional records, effective N ~ 11750 |30|], see text. 



white (7) black (1) and Asian (1) seeds. When "capturing" other race groups, most of estimates 
were much closer to the true population size {N = 13510 according to institutional records), 
and with a slight advantage to Nadj- This advantage is further strengthened when takirig into 



301] to 



consideration various problems in the data due to incomplete records, which led ref. 
suggest that N ~ 11750 should be considered as the actual population size of known racial 
affiliation. 



4 Discussion 

Here we considered capture-recapture experiments with heterogenous catchability. In contrast 
to common practice, we consider (unknown) capture probabilities that depend on the size of 
the population (and thus many might take small values), while on the other hand we allow the 
assumption that they are related to an observed covariate in some manner. 

An important example for such setting, and the application we had in mind, is respondent 
driven sampling, where the probability to sample an individual is assumed to be proportional 
to his degree. In addition to the concern whether the degree of an individual has a direct 
bearing on its probability of being sampled, there are at least two other concerns regarding this 
assumption implicit in application of RDS: 

first, sampled individuals might incorrectly report their own degree, possibly causing some bias. 
Second, as noted in Q], if a large fraction of the population is sampled there will be a deviation 
from proportionality between degree and catchability. 

Indeed, errors in assessing one's own degree can be addressed by using, e.g., the scale-up 
method [171], and deviations from proportionality could be approached by a method in the 
5|. However, deviations from proportionality are generally only problematic after a 



spirit of 



large fraction has been sampled; indeed the simulations in [S] have a nice capture-recapture 
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interpretation: after marking 20% of the population (with some preference according to degree) 
a subsequent RDS-Hke sample was obtained; even when 50% of the population was recaptured, 
a simple RDS weighting yielded a good estimate of the fraction of marked individuals. 

Furthermore, although such corrections might be considered and applied, there is evidence 
suggesting this is not crucial; our analysis of the Uganda data, for example, gave encouragingly 
small bias when using the reported degrees, even for a sample of size 927/2402. 

Perhaps the most interesting experiment reported here involves using the "age" and "race" 
categories when estimating the size of the population (for, respectively, the Uganda and the We- 
bRDS data-sets). While the idea of using various patient records or lists in "capture-recapture" 
studies is not new, it is usually employed for ascertaining the number of people with a certain 
medical condition (related directly to being on such a list) and not the size of the general popu- 
lation at risk (see [J] and references therein). On the other hand, the WHO/UNAIDS 'Guideline 



281 ]. for example, embraces the 



on Estimating the Size of Populations Most at Risk to HIV 
use of capture-recapture methods but strongly advocates against violating the assumptions of 
the simple model (i.e., homogenous catchability, and excluding "non-random" incidence lists). 
Nevertheless, as suggested from theorem 2 and demonstrated in the experiment in sec. 3, even if 
the first capture stage is non-random we should still be able to obtain a good estimate. It is also 
worth noting there is no violation of individual's privacy, since the only information required 
from the records is the number of registered individuals. Thus, the method suggested here could 
easily be adapted and used for other important applications in many diverse scenarios utilizing 
a "capture" stage, as in sec. 3.2 or as the "key-chain" method [23], where there is no a-priori 
guarantee individuals will be sampled randomly in a uniform manner. 



5 Appendix 

Theorems 1 and 2 are derived in detail below. 
Recall the following inequality by McDiarmid 



Lemma 1. [McDiarmid].' Let Xi, X2, Xj\[ be real valued random variable and let de- 
note the random vector [Xi, X2, X^]. Consider a function f : — )• R and suppose there 
exist a constant, c, such that 
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Mf(^) I [^l,-'^2,-,-'^jfc-l] = [xi,X2,...,Xk-l],Xk =Xk] 

-E[f(^) I [Xi,X2,...,Xk-i] = [xi,X2,...,Xk-i],Xk = x[]\ < c 
for each k = 1,2..., N and xi (i=l,2,...,k). Then for any t 

n\f{%-nf{%]\>t)<2exp[^] 

Denote by I^'"*^ the indicator function whether individual k was recaptured at the second 
stage, regardless of the first stage. Considering the sequence of N random variables, {-'^fclj^i, 
with Xfe := and the function /(^) = Ylk=i -^k we have 



E[/(^)] = E[^Xfc] = i-] = E^^:^ = E f^'^^ = ^^ = (8) 

k=l k=l k=l k=l 

Without loss of generality, we can assume that mini{ni) = 1 and has a mutual bound c = 1, 
as required by lemma 1, and obtain 



Lemma 2. X]fc=i ^ concentrated near with 



Proof. Apply lemma 1, with f and c as above. 

Similarly, denoting by /3j's the capture probabilities in the first stage, and using only the fact 
that they sum up to aiN we have 

JV ,1,1 N N 

^] =y,Pk7rk^ = yz^k^^k^ = «i«2iv (9) 

t[^k ^k f^^ N TTk 

and obtain 

Lemma 3. Y\u-^ is concentrated near aia2N with 
Proof. The same as for lemma 2 above. 
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Our first main result is: 



Theorem 1. As N ^ oo, -jf I 

(a) with 

n aia^N—aiyfN-logN ^ Ng^j - a\a2N+aiy/N-logN \ \ -i _ 4 

aia2N+^N-logN - N — aia2N-^N-logN ^ - 7^ 

(h) and in expectation. I.e., E[^^^-] — >■ 1. 

Proof, (a) is obtained via a simple combination of lemmata 3 and 4- As for (b), recall we 
have at least one individual captured twice and thus Nadj is bounded from above by iV^ . Now 

(b) follows from (a) easily. 

Noticing we liave made no assumptions about tlie /3j's, the capture probabilities in the first 
stage, apart from the fact that they sum up to aiN, we also have 

Theorem 2. Nadj is robust with respect to different choices of the the capture probabilities in the 
first stage, {Alili- 
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